Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: 📝 pseudo code and docstring for write_resource_parquet() #816

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

lwjohnst86
Copy link
Member

@lwjohnst86 lwjohnst86 commented Oct 25, 2024

Description

Based on @martonvago's suggestion, I'll write things in "pseudocode" from now on. But instead of pseudocode, I will write an outline of the Python function with how I think it might flow inside. Plus, I can write the full docstrings inside, so you all don't need and we don't need to move it over from the Quarto doc. I have NOT ran this, tested it, or did any execution, this is purely how I think it might work, hence "pseudo" 😛. I'll add some comments directly to the code in the PR.

Closes #642

This PR needs an in-depth review.

Checklist

  • Updated documentation

@lwjohnst86 lwjohnst86 requested a review from a team as a code owner October 25, 2024 02:32
docs/design/implementation/python-functions.qmd Outdated Show resolved Hide resolved
sprout/core/write_resource_parquet.py Outdated Show resolved Hide resolved
sprout/core/write_resource_parquet.py Outdated Show resolved Hide resolved
sprout/core/write_resource_parquet.py Outdated Show resolved Hide resolved
sprout/core/write_resource_parquet.py Outdated Show resolved Hide resolved
sprout/core/write_resource_parquet.py Outdated Show resolved Hide resolved
sprout/core/write_resource_parquet.py Outdated Show resolved Hide resolved
Copy link
Contributor

@martonvago martonvago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!! Just some questions.

I think I've developed some confusion about what the raw data files represent. Are they different versions of the data (with later versions overwriting earlier ones) or different sections of the data (e.g. one file for rows 1-100 and another one for rows 101-200)? Well, I guess there is no reason why they couldn't be used as both...

sprout/core/write_resource_parquet.py Outdated Show resolved Hide resolved
sprout/core/write_resource_parquet.py Outdated Show resolved Hide resolved
sprout/core/write_resource_parquet.py Outdated Show resolved Hide resolved
sprout/core/write_resource_parquet.py Outdated Show resolved Hide resolved
@lwjohnst86
Copy link
Member Author

@martonvago I forgot to respond to your initial question.

Raw files are kept from the initial upload to keep a record just in case something happens.

A potential scenario might be, a first round of surveys are sent to people and that data gets uploaded to Sprout. That's one raw file. Maybe a few months later, the same survey is sent out and that data gets uploaded. That's another raw file. So those two raw files get merged together and saved as the data.parquet file, which would be the file that researchers actually use to do analyses.

Copy link
Member

@signekb signekb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall picture of this makes sense to me as well 👍

sprout/core/write_resource_parquet.py Outdated Show resolved Hide resolved
sprout/core/write_resource_parquet.py Outdated Show resolved Hide resolved
sprout/core/write_resource_parquet.py Outdated Show resolved Hide resolved
sprout/core/write_resource_parquet.py Outdated Show resolved Hide resolved
@lwjohnst86 lwjohnst86 marked this pull request as draft November 13, 2024 14:13
@lwjohnst86 lwjohnst86 marked this pull request as ready for review February 13, 2025 08:23
@@ -127,13 +127,21 @@ flowchart
function --> out
```

### {{< var wip >}} `write_resource_parquet(raw_files, path)`
### {{< var wip >}} `build_resource_parquet(raw_files_path, resource_properties)`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure of the naming here. And I'm unsure if it should output a DataFrame and have another function write_resource_parquet() that does the writing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if that DataFrame output is used somewhere else, have 2 functions, otherwise have one function that does the writing as well?
build or create sounds okay to me.

Comment on lines +9 to +13
While Sprout generally assumes
that the files stored in the `resources/raw/` folder have already been
verified and validated, this function does some quick verification checks
of the data after reading it into Python from the raw file(s) by comparing
with the current properties given by the `resource_properties`. All data in the
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
While Sprout generally assumes
that the files stored in the `resources/raw/` folder have already been
verified and validated, this function does some quick verification checks
of the data after reading it into Python from the raw file(s) by comparing
with the current properties given by the `resource_properties`. All data in the
While Sprout generally assumes
that the files stored in the `resources/raw/` folder are already correctly
structured and tidy, it still runs checks to ensure the data are correct
by comparing to the properties. All data in the

Copy link
Contributor

@martonvago martonvago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very sensible to me 😁

@@ -127,13 +127,21 @@ flowchart
function --> out
```

### {{< var wip >}} `write_resource_parquet(raw_files, path)`
### {{< var wip >}} `build_resource_parquet(raw_files_path, resource_properties)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if that DataFrame output is used somewhere else, have 2 functions, otherwise have one function that does the writing as well?
build or create sounds okay to me.


If there are any duplicate observation units in the data, only the most recent
observation unit will be kept. This way, if there are any errors or mistakes
in older raw files that has been corrected in later files, the mistake can still
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
in older raw files that has been corrected in later files, the mistake can still
in older raw files that have been corrected in later files, the mistake can still

Comment on lines +27 to +31
sp.write_resource_parquet(
raw_files_path=sp.path_resources_raw_files(1, 1),
parquet_path=sp.path_resource_data(1, 1),
properties_path=sp.path_package_properties(1, 1),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be updated?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

Create internal flow diagram for write_resource_parquet()
3 participants